Skip to content

fix(bots): skip bot upsert when nothing changed to stop team-strip + reindex loop on boot#28128

Open
joaopamaral wants to merge 2 commits into
open-metadata:mainfrom
Automattic:bot-team-strip-fix
Open

fix(bots): skip bot upsert when nothing changed to stop team-strip + reindex loop on boot#28128
joaopamaral wants to merge 2 commits into
open-metadata:mainfrom
Automattic:bot-team-strip-fix

Conversation

@joaopamaral
Copy link
Copy Markdown

Summary

BotResource.initialize() runs UserUtil.addOrUpdateBotUser(user) for every bot on every OM boot. The in-memory User built by UserUtil.user(...) does not have the teams field populated, so the PUT path through userRepository.createOrUpdate -> UserUpdater.entitySpecificUpdate runs updateTeams(original, updated) with original.teams = [Organization] (or the bot's real stored teams) and updated.teams = null. updateTeams then executes:

deleteTo(original.getId(), USER, Relationship.HAS, Entity.TEAM);
assignTeams(updated, updated.getTeams()); // updated.getTeams() == null

…which strips every stored team membership the bot had, bumps the user version, and triggers an Elasticsearch reindex of both the user and each affected team. With several bots this fires on every restart and produces the reindex storm we observed in production logs:

storeNewVersion called for entity: user <bot-id>,
  changeDescription=fieldsDeleted=[
    FieldChange[name=teams,
                oldValue=[{"id":"...","name":"Organization",...}],
                newValue=<null>]]

Repeated for profiler-bot, governance-bot, usage-bot, ingestion-bot, … each one taking ~100–200 ms plus a team-side reindex. We also saw Circular dependency detected in team hierarchy for team: Organization. Skipping to prevent StackOverflowError. from SubjectContext during the same window — the boot-time team churn was tripping the cycle guard.

In a real environment with several bots this added ~3 minutes to every boot.

Fix

Short-circuit addOrUpdateBotUser when the incoming bot has no real change vs. the persisted row: compare description, displayName, and roles. If they all match, return the original user and skip the PUT entirely — no UserUpdater, no team strip, no version bump, no reindex.

Two small adjustments to make the guard actually fire:

  • retrieveWithAuthMechanism now also loads \"roles\" (was loading only \"authenticationMechanism\"). description and displayName are scalar JSON-column fields and were already populated by the base read.
  • Compare roles via listOrEmpty(...) on both sides because the database-loaded original returns an empty list while the freshly built in-memory user returns null, and Objects.equals(null, []) is false.

The first call still hits the existing code path (no originalUser -> guard skipped), so seeding new bots is unchanged.

Reproducer

  1. Boot OpenMetadata. The standard set of bots (ingestion-bot, profiler-bot, governance-bot, usage-bot, …) is upserted by BotResource.initialize().
  2. As an admin, assign one of those bots to a real (non-Organization) team.
  3. Restart OpenMetadata.
  4. Observed (before this fix): the bot is no longer in that team, only Organization. Boot logs show fieldsDeleted=[name=teams, oldValue=[...], newValue=null] for every bot, each followed by user + team ES reindex log lines.
  5. After this fix: the bot still belongs to the assigned team after the restart. None of those fieldsDeleted=[teams] log lines appear. Boot time drops accordingly (~3 minutes saved in production).

Test plan

Added openmetadata-service/src/test/java/org/openmetadata/service/util/UserUtilBotTest with two Mockito unit tests:

  • addOrUpdateBotUserShortCircuitsWhenNothingChanged — mocks UserRepository, stubs getByName to return a stored bot whose roles/description/displayName match the incoming user, calls addOrUpdateBotUser(boundary), asserts the return is the same instance as the stored user, and verifies userRepository.createOrUpdate(...) was never called. Verified to fail before the fix (test reaches UserUtil.addOrUpdateUser and explodes on the unstubbed createOrUpdate); passes after the fix.
  • addOrUpdateBotUserGoesThroughUpsertWhenDisplayNameChanged — same setup but with mismatching displayName; verifies userRepository.createOrUpdate(...) is called, so we don't accidentally short-circuit on real changes.

Local manual verification: spun the fix into the 1.12.7 backport branch we run in production, restarted, observed no fieldsDeleted=[teams] log entries for the bot users and no follow-up bot/team reindex log lines. Boot duration dropped by ~3 minutes.

Opening as draft for maintainer feedback on:

  • whether the comparison set (roles, description, displayName) is acceptable or you'd prefer a broader/narrower field check;
  • the chosen test style (Mockito unit test) vs. a different pattern you'd rather see.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@joaopamaral joaopamaral force-pushed the bot-team-strip-fix branch from f5365ec to 71c6156 Compare May 18, 2026 14:26
@joaopamaral
Copy link
Copy Markdown
Author

Addressed all four bot review comments in 71c6156a72:

  1. Imports — added AuthenticationMechanism / AuthType imports, dropped the inline FQNs at the old lines 66-67 and 108-109.
  2. Swallowed RuntimeException — replaced the catch (RuntimeException ignored) flow-control pattern with a real createOrUpdate stub that returns a PutResponse so addOrUpdateUser completes normally and the assertNotEquals(stored, result, ...) assertion does the work.
  3. Unused roleRef — removed.
  4. email not in short-circuit comparison — added Objects.equals(originalUser.getEmail(), user.getEmail()) to the guard with a short comment explaining the case (an admin changing principalDomain between restarts → UserUtil.user(...) derives a different email → must not be skipped). Both unit tests now set the email on stored and incoming and still pass; without the fix on the underlying method the test still fails (verified by reverting UserUtil.java to upstream/main, running the test → 1 error; restoring → both green).

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@joaopamaral joaopamaral force-pushed the bot-team-strip-fix branch from 71c6156 to c85a520 Compare May 18, 2026 14:36
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@joaopamaral joaopamaral marked this pull request as ready for review May 18, 2026 14:38
Copilot AI review requested due to automatic review settings May 18, 2026 14:38
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents a boot-time loop where bot users are upserted on every restart, which (because the in-memory bot User lacks teams) triggers updateTeams(...) to remove existing bot team memberships and causes repeated version bumps + search reindex storms. The fix adds a no-op short-circuit in UserUtil.addOrUpdateBotUser(...), and adds unit tests to ensure the short-circuit behavior.

Changes:

  • Add a short-circuit in UserUtil.addOrUpdateBotUser to skip createOrUpdate when key bot fields are unchanged.
  • Ensure retrieveWithAuthMechanism also loads roles so the short-circuit can compare them.
  • Add Mockito-based unit tests covering both the no-op path and the upsert path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/util/UserUtil.java Adds the short-circuit guard and expands the fields loaded for bot comparison.
openmetadata-service/src/test/java/org/openmetadata/service/util/UserUtilBotTest.java Adds unit tests validating short-circuit vs. upsert behavior.

yan-3005
yan-3005 previously approved these changes May 18, 2026
@yan-3005 yan-3005 added the safe to test Add this label to run secure Github workflows on PRs label May 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

🟡 Playwright Results — all passed (14 flaky)

✅ 4056 passed · ❌ 0 failed · 🟡 14 flaky · ⏭️ 103 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 743 0 7 25
🟡 Shard 3 782 0 2 7
🟡 Shard 4 788 0 2 18
✅ Shard 5 709 0 0 41
🟡 Shard 6 736 0 2 8
🟡 14 flaky test(s) (passed on retry)
  • Features/TeamsHierarchy.spec.ts › Delete Parent Team (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/BulkImport.spec.ts › Table (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test infinite scroll/pagination (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/Table.spec.ts › Table search with sorting should work (shard 3, 1 retry)
  • Flow/SchemaTable.spec.ts › schema table test (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Set & Update all CP types on apiCollection (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Worksheet (shard 4, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/ServiceEntity.spec.ts › Announcement create, edit & delete (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@joaopamaral
Copy link
Copy Markdown
Author

Checking failing tests 👀

…reindex loop on boot

`BotResource.initialize()` runs `UserUtil.addOrUpdateBotUser(user)` for
every bot on every OM boot. The in-memory `User` built by
`UserUtil.user(...)` does not have the `teams` field populated, so the
PUT path through `userRepository.createOrUpdate ->
UserUpdater.entitySpecificUpdate` runs `updateTeams(original, updated)`
with `original.teams = [Organization]` (or the bot's real stored teams)
and `updated.teams = null`. `updateTeams` then executes
`deleteTo(user, HAS, TEAM) + assignTeams(null)`, which strips every
stored team membership the bot had, bumps the user version, and triggers
an Elasticsearch reindex of both the user and each affected team. With
several bots this fires on every restart and produces the reindex storm
plus "Circular dependency detected in team hierarchy for team:
Organization" warnings in the boot logs. In one production deployment
this added almost 3 minutes to every boot.

Short-circuit when the incoming bot has no real change vs. the persisted
row: compare `description`, `displayName`, and `roles`. If they all
match, return the original user and skip the PUT entirely — no
`UserUpdater`, no team strip, no version bump, no reindex.

Two adjustments to make the guard actually fire:
- `retrieveWithAuthMechanism` now also loads `"roles"` (was loading only
  `"authenticationMechanism"`); `description` and `displayName` are
  scalar JSON-column fields and were already populated by the base read.
- Compare `roles` via `listOrEmpty(...)` on both sides because the
  database-loaded original returns an empty list while the freshly built
  in-memory user returns null, and `Objects.equals(null, [])` is false.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +335 to +341
// Short-circuit when the incoming bot user has no real change vs. what's already in the
// database. Without this guard every OM boot calls into addOrUpdateUser ->
// userRepository.createOrUpdate, and UserUpdater.entitySpecificUpdate then runs
// updateTeams/updateRoles/etc. with the incoming `user.getTeams() == null`, which strips
// the bot's stored team relationships, bumps the version, and triggers an Elasticsearch
// reindex of the bot user (and any team membership change ripples into the team_search
// index too). With many bots this is a reindex storm on every restart.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Overly verbose inline comment block in production code

The 7-line comment block (lines 335–341) explains the why behind the entire PR rather than documenting non-obvious behaviour at the call site. The PR description already captures the rationale. Per the project's "write self-documenting code; only document complex business logic or workarounds" guideline, this would be cleaner as a 1–2 line comment (e.g., // Skip upsert when nothing changed — avoids stripping teams/triggering reindex on boot). The rest is commit-message / PR-description material.

Was this helpful? React with 👍 / 👎

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 18, 2026

Code Review 👍 Approved with suggestions 5 resolved / 6 findings

Prevents redundant bot upserts during boot by adding a short-circuit comparison for bot attributes, successfully resolving the team-strip and reindexing loop. Please consider condensing the verbose inline comment block in the production logic.

💡 Quality: Overly verbose inline comment block in production code

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/UserUtil.java:335-341

The 7-line comment block (lines 335–341) explains the why behind the entire PR rather than documenting non-obvious behaviour at the call site. The PR description already captures the rationale. Per the project's "write self-documenting code; only document complex business logic or workarounds" guideline, this would be cleaner as a 1–2 line comment (e.g., // Skip upsert when nothing changed — avoids stripping teams/triggering reindex on boot). The rest is commit-message / PR-description material.

✅ 5 resolved
Quality: Fully qualified class names used instead of imports

📄 openmetadata-service/src/test/java/org/openmetadata/service/util/UserUtilBotTest.java:66-67 📄 openmetadata-service/src/test/java/org/openmetadata/service/util/UserUtilBotTest.java:108-109
The test uses fully qualified names org.openmetadata.schema.entity.teams.AuthenticationMechanism inline (lines 66-67 and 108-109) instead of importing the class. Per project conventions, wildcard imports and fully qualified names should be avoided — add a proper import statement instead.

Quality: Swallowed RuntimeException uses flow-control exception pattern

📄 openmetadata-service/src/test/java/org/openmetadata/service/util/UserUtilBotTest.java:125-135
The test at line 131 catches RuntimeException ignored to handle the case where the mock doesn't fully stub the downstream path. This is a flow-control exception anti-pattern and makes the test fragile — if addOrUpdateBotUser throws for an unexpected reason, the test silently passes. Consider stubbing createOrUpdate to return a value (or use doNothing()/doReturn(...)) so the method completes normally, and assert on the result instead.

Quality: Unused private helper method roleRef should be removed

📄 openmetadata-service/src/test/java/org/openmetadata/service/util/UserUtilBotTest.java:140-145
The private method roleRef at lines 140-145 is annotated with @SuppressWarnings("unused") and is never called. Commented-out or dead code should be removed per project conventions. If it's intended for future tests, add it when those tests are written.

Edge Case: Short-circuit doesn't compare email field changes

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/UserUtil.java:342-346
The guard at line 342-346 compares roles, description, and displayName but not email. The UserUtil.user(...) method (which builds the in-memory bot user) may set an email based on domain. If an admin changes the domain configuration between restarts, the email update would be silently skipped by the short-circuit. Consider whether email should be included in the comparison, or document why it's excluded.

Quality: Fully qualified class names instead of imports in test

📄 openmetadata-service/src/test/java/org/openmetadata/service/util/UserUtilBotTest.java:128-130
Lines 128-130 of UserUtilBotTest.java use fully qualified class names (jakarta.ws.rs.core.Response.Status.OK and org.openmetadata.schema.type.EventType.ENTITY_UPDATED) inline instead of importing them. Per project conventions, wildcard and fully qualified names should be avoided — add proper imports at the top of the file.

🤖 Prompt for agents
Code Review: Prevents redundant bot upserts during boot by adding a short-circuit comparison for bot attributes, successfully resolving the team-strip and reindexing loop. Please consider condensing the verbose inline comment block in the production logic.

1. 💡 Quality: Overly verbose inline comment block in production code
   Files: openmetadata-service/src/main/java/org/openmetadata/service/util/UserUtil.java:335-341

   The 7-line comment block (lines 335–341) explains the *why* behind the entire PR rather than documenting non-obvious behaviour at the call site. The PR description already captures the rationale. Per the project's "write self-documenting code; only document complex business logic or workarounds" guideline, this would be cleaner as a 1–2 line comment (e.g., `// Skip upsert when nothing changed — avoids stripping teams/triggering reindex on boot`). The rest is commit-message / PR-description material.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants